Optimize LenVM guided sampling latency#3
Open
namezhenzhang wants to merge 1 commit into
Open
Conversation
ChangyiYang
added a commit
to ChangyiYang/Length-Value-Model
that referenced
this pull request
May 24, 2026
When the requested (mode, scale) is mathematically equivalent to vanilla sampling -- centered_exp/value_bias scale=0, mul scale<=0 or scale==1, or other expectation modes at scale==1 -- _req_wants_value_guidance now returns False so the entire LVM path is skipped for that request. The paper's neutral config (centered_exp scale=0) is exactly such a no-op tilt: exp(0 * value) == 1 for all candidates, so the LVM forward runs but contributes nothing to token selection. Skipping it should drop wall_clock from ~72 s back to vanilla (~20 s) at scale=0, with zero change to token-selection behavior. Hard-constraint requests (target_value/target_length/value_constraint/ cmp/op) always need LVM, regardless of scale. Per-req result is cached on the Req via the existing _lvm_wants_guidance slot. Adds _get_req_value_mode_and_scale helper that caches (mode, scale) parsing per req, keyed on custom_params identity + entries. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChangyiYang
added a commit
to ChangyiYang/Length-Value-Model
that referenced
this pull request
May 24, 2026
Cherry-picked from upstream PR UCSB-AI#3 (namezhenzhang): * lvm_value_utils.get_eos_token_ids: cache result on req._lvm_eos_token_ids; was re-walking sets every candidate filter call. * qwen2_lvm.py / qwen3_lvm.py / qwen2_5_vl_lvm.py: drop two forward_batch.{extend_seq_lens,extend_prefix_lens}.tolist() calls per LVM forward by carrying tree_value_cached_prefix_lens in the TreeValueSpecInput. With ~1300 LVM forwards in the paper run, that removes ~2600 GPU->CPU syncs. * tree_value_spec.py: vectorize the candidate self-attention diagonal using numpy fancy indexing instead of a per-token Python loop. No algorithm change; same forward outputs. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ChangyiYang
added a commit
to ChangyiYang/Length-Value-Model
that referenced
this pull request
May 24, 2026
…CSB-AI#3) lvm_inproc_runner.py: - eval_candidates_batch_gpu now accepts candidate_lens_per_req so the runner can skip re-walking candidate_ids_per_req lists. - Add extend_and_eval_candidates_batch_gpu(): in steady-state decode each request adds at most 1 prefix token, so we can run prefix-extend + candidate-scoring in ONE forward (Q_len = 1 + k) with the right tree mask. Returns None for unsupported configurations (VLM, page_size != 1, or any per-req prefix delta > 1 token) so callers can fall back. lvm_guided_sampling.py: - PendingLvmResult gains candidate_lens_send so build_pending can pass per-row valid counts down without recomputing. - _Inproc adds tree_value_extend_and_launch_gpu wrapper that schedules the fused kernel on lvm_stream. - apply() GPU path now tries the fused kernel first; on None fall-back runs the original two-phase tree_value_extend + tree_value_launch_gpu. - timer.set_meta(lvm_fused_path=int(fused)) so we can quantify how often fusion actually triggers. Our build_pending CPU/sync trim + vectorized scatter + per-req cache are all preserved; the fused path is opt-in on the GPU branch only. Co-Authored-By: Zhen Zhang <namezhenzhang@users.noreply.github.com> Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Follow-up to the LenVM timing analysis in PR #2. This PR optimizes the in-process LenVM guided sampling hot path and adds a written latency summary in
docs/lenvm-guided-sampling-optimization.md.Main changes:
centered_expscale 0.[batch, vocab]probability tensor..tolist()GPU syncs in Qwen2/Qwen3/Qwen2.5-VL LenVM value slicing by carrying prefix/candidate metadata in the tree-value spec.SGLANG_LVM_TIMING.Benchmark Summary
Reference config: 1x H100 SXM,
Qwen/Qwen2.5-7B-Instruct+namezz/lvm-math-0402-a-qwen2.5-7b-instruct-b-qwen2.5-1.5b-instruct, GSM8K 50 questions x 16 samples,max_tokens=6000, T=1.0, top-p=1.0, min-p=0.01. Baseline usestop_k=-1; LenVM usestop_k=5,value_mode=centered_exp,value_scale=0.001,gamma=0.997.Compared with the PR #2 reference result, 19.22 s -> 87.44 s (4.55x slower), this reduces the guided run wall clock by 8.5%, the slowdown ratio by 11.4%, and the incremental LenVM overhead by 11.8%.
Model memory from the same run: base 7B bf16 weights 14.30 GB; LenVM 1.5B bf16 weights 3.03 GB.
Profiling (
SGLANG_LVM_TIMING=1, q10/n4/topk5/scale=0.001) shows baselineSampler.forwardat about 0.70-0.77 ms. With LenVM active,Sampler.forwardis about 13-15 ms after warmup andLvmGuidedSampler.applyaccounts for about 97% of it. The remaining bottleneck is still rows that fall back to the two-forward LenVM path: prefix extend plus candidate launch.Artifacts from the Slurm runs:
results/timing/pr7b_q50n16_s001_84065_20260523_211347/results/timing/pr7b_prof_s001_84067_20260523_212103/Tests
.venv-infer/bin/python -m compileall -q ...on modified SGLang files and focused testsgit diff --check.venv-infer/bin/pythonbecause this workspace does not currently havepytestinstalled:test_lvm_guided_sampling_fast_path.py: 7 tests passedtest_tree_value_spec.py: 2 tests passedFollow-up
The fused path is still often all-or-nothing at the batch level. The next high-value optimization is to split a batch into fusible and fallback rows, then run fused extend+candidate scoring for eligible rows and keep the two-phase path only for the rest.